HOME/Articles/

2017-9-28 Python 编码之禅

Article Outline

情景描述:

平时工作中经常碰到编码、解码、乱码……类似的问题不胜其烦,如街边小广告一般异常讨厌,需要花时间好好整理一番,“一”绝后患。

<!--more-->

str(s)与unicode(s)

str(s)和unicode(s)是两个工厂方法,分别返回str字符串对象和unicode字符串对象; str(s)是s.encode(‘ascii’)的简写; unicode(s)是s.decode(‘ascii’)的简写;

str
str(object='')
str(object=b'', encoding='utf-8', errors='strict')

object - object whose informal representation is to be returned
encoding - Defaults of UTF-8. Encoding of the given object
errors - response when decoding fails. There are six types of error response:
    strict - default response which raises a UnicodeDecodeError exception on failure
    ignore - ignores the unencodable unicode from the result
    replace - replaces the unencodable unicode to a question mark ?
    xmlcharrefreplace - inserts XML character reference instead of unencodable unicode
    backslashreplace - inserts a \uNNNN espace sequence instead of unencodable unicode
    namereplace - inserts a \N{...} escape sequence instead of unencodable unicode


>>> s3 = u"你好" 
>>> s3 
u'\u4f60\u597d' 
>>> str(s3) 
Traceback (most recent call last): 
    File "<stdin>", line 1, in <module> 
UnicodeEncodeError: 'ascii' codec can't encode characters in position 0-1: ordinal not in range(128)

上面s3是unicode类型的字符串,str(s3)相当于是执行s3.encode(‘ascii’)因为“你好”两个汉字不能用ascii码来表示,所以就报错了,指定正确的编码:s3.encode(‘gbk’)或者s3.encode("utf-8")就不会出现这个问题了。

类似的unicode有同样的错误:
>>> s4 = "你好" 
>>> unicode(s4) 
Traceback (most recent call last): 
    File "<stdin>", line 1, in <module> 
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc4 in position 0: ordinal not in range(128) 

unicode(s4)等效于s4.decode(‘ascii’),因此要正确的转换就要正确指定其编码s4.decode(‘gbk’)或者s4.decode("utf-8")。

###

In [20]: '中文'
Out[20]: '\xe4\xb8\xad\xe6\x96\x87'

In [21]: u'中文'
Out[21]: u'\u4e2d\u6587'

In [29]: print '中文'
中文

In [30]: print u'中文'
中文

In [34]: print '\u4e2d\u6587'
\u4e2d\u6587

In [35]: print u'\u4e2d\u6587'
中文

In [26]: u'中文'.encode('gb2312')
Out[26]: '\xd6\xd0\xce\xc4'

In [27]: u'中文'.encode('gbk')
Out[27]: '\xd6\xd0\xce\xc4'

In [28]: u'中文'.encode('utf8')
Out[28]: '\xe4\xb8\xad\xe6\x96\x87'

问题:

In [41]: '中文'.encode('utf8')
---------------------------------------------------------------------------
UnicodeDecodeError                        Traceback (most recent call last)
<ipython-input-41-94bb800b6371> in <module>()
----> 1 '中文'.encode('utf8')

UnicodeDecodeError: 'ascii' codec can't decode byte 0xe4 in position 0: ordinal not in range(128)

解决办法:

import sys
reload(sys)
sys.setdefaultencoding('utf-8')